Joseph Zemmels
Funding statement
This work was partially funded by the Center for Statistics and Applications in Forensic Evidence (CSAFE) through Cooperative Agreement 70NANB20H019 between NIST and Iowa State University, which includes activities carried out at Carnegie Mellon University, Duke University, University of California Irvine, University of Virginia, West Virginia University, University of Pennsylvania, Swarthmore College and University of Nebraska, Lincoln.
Background on Firearm & Tool Mark Exams
Cartridge Case Comparison Algorithms
Diagnostic Tools for Cartridge Case Comparisons
Automatic Cartridge Evidence Scoring (ACES)
Conclusions

Cartridge Case: metal casing containing primer, powder, and a projectile
Breech Face: back wall of gun barrel
Breech Face Impressions: markings left on cartridge case surface by the breech face during the firing process
Cartridge case recovered from crime scene vs. fired from suspect’s firearm
Place evidence under a comparison microscope for simultaneous viewing (Thompson 2017)
Assess the “agreement” of impressions on the two cartridge cases (AFTE Criteria for Identification Committee 1992)

Class characteristics: features associated with manufacturer of the firearm.
E.g., size of ammunition, width and twist direction of barrel rifling
Used to narrow the relevant population of potential firearms
Individual characteristics: markings attributed to imperfections in the firearm surface.
Subclass characteristics: markings that are reproduced across a sub-group of firearms.
E.g., barrels milled by the same machine may share similar markings
Difficult to distinguish individual from subclass characteristics
“Sufficient” agreement of class and individual characteristics suggests that the evidence originated from the same firearm (AFTE Criteria for Identification Committee 1992).
Identification: Agreement of a combination of individual characteristics and all discernible class characteristics where the extent of agreement exceeds that which can occur in the comparison of toolmarks made by different tools and is consistent with the agreement demonstrated by toolmarks known to have been produced by the same tool.
Inconclusive:
2.1 Some agreement of individual characteristics and all discernible class characteristics, but insufficient for an identification.
2.2 Agreement of all discernible class characteristics without agreement or disagreement of individual characteristics due to an absence, insufficiency, or lack of reproducibility.
2.3 Agreement of all discernible class characteristics and disagreement of individual characteristics, but insufficient for an elimination.
Elimination: Significant disagreement of discernible class characteristics and/or individual characteristics.
Unsuitable: Unsuitable for examination.

Some of these steps are performed implicitly by the examiner. For example:
“pre-processing” includes adjusting lighting on the comparison stage.
The examiner determines a similarity “score” to inform their decision.
This pipeline structure is also useful when considering automatic comparison algorithms.
National Research Council (2009):
“[T]he decision of a toolmark examiner remains a subjective decision based on unarticulated standards and no statistical foundation for estimation of error rates”
President’s Council of Advisors on Science and Technology (2016):
“A second - and more important - direction is (as with latent print analysis) to convert firearms analysis from a subjective method to an objective method. This would involve developing and testing image-analysis algorithms for comparing the similarity of tool marks on bullets [and cartridge cases].”
We introduce the Automatic Cartridge Evidence Scoring (ACES) algorithm to compare 3D topographical images of cartridge cases
Separated cartridge cases into quartets: 3 known-match + 1 unknown source
Match if fired from the same firearm, Non-match if fired from different firearms
218 examiners tasked with determining whether the unknown cartridge case originated from the same pistol as the known-match cartridge cases
| Match Conclusion | Non-match Conclusion | Inconclusive Conclusion | Total | |
|---|---|---|---|---|
| Ground-truth Match | 1,075 | 4 | 11 | 1,090 |
| Ground-truth Non-match | 22 | 1,421 | 735 + 2* | 2,180 |
*Two non-match comparisons were deemed “unsuitable for comparison”
| True Positive (%) | True Negative (%) | Overall Inconclusives (%) |
|---|---|---|
| 99.6 | 65.2 | 22.9 |

3D topographic images using Cadre\(^{\text{TM}}\) TopMatch scanner from Roy J Carver High Resolution Microscopy Facility
x3p file contains surface measurements at lateral resolution of 1.8 micrometers (“microns”) per pixel
Obtain an objective measure of similarity between two cartridge cases
Examiner takes similarity score into account during an examination
Challenging to know how/when these steps work correctly
Isolate region in scan that consistently contains breech face impressions
How do we know when a scan is adequately pre-processed?
Subsequent Pre-processing Effects:

Gaussian Filter Examples:

Erosion:

Takeaway: No current consensus on “best” pre-processing pipeline
Experimentation is needed to identify optimal parameters.

Cross-correlation function (CCF) measures similarity between scans
For two images \(A\) and \(B\), cross-correlation function \((A \star B)\) can be computed by:
\[(A \star B)[m,n] = \mathcal{F}^{-1}\left(\overline{\mathcal{F}(A)} \odot \mathcal{F}(B)\right)[m,n]\]
Maximum CCF indicates translation \([m^*, n^*]\) and rotation \(\theta^*\) at which two images align.
Index \(i,j\) maps to \(i^*, j^*\) by:
\[\begin{pmatrix} j^* \\ i^* \end{pmatrix} = \begin{pmatrix} n^* \\ m^* \end{pmatrix} + \begin{pmatrix} \cos(\theta^*) & -\sin(\theta^*) \\ \sin(\theta^*) & \cos(\theta^*) \end{pmatrix} \begin{pmatrix} j \\ i \end{pmatrix}.\]
Split one scan into a grid of cells that are each registered to the other scan (Song 2013)
For a matching pair, we assume that cells will agree on the same rotation & translation

Why does the algorithm “choose” a particular registration?
For each rotation, each cell “votes” for where it aligns best in the other scan.
For truly matching cartridge cases, the cells should “agree” on a translation at the true rotation.
In this example, \((\theta^*, m^*, n^*) = (3^\circ, 10, -10)\) appears to be the “consensus.”
Takeaway: Similar to pre-processing, very little consensus on “best” parameters
Measure of similarity for two cartridge cases
Maximized CCF (0.27 in example below) (Vorburger et al. 2007; Tai and Eddy 2018)
Congruent Matching Cells (11 CMCs in example below) (Song 2013)

What factors influence the final similarity score?
For each reference cell \(i = 1,...,N\):
1.1. Calculate the rotation that maximizes the CCF:
\[\hat{\theta}_i = \arg \max_{\theta \in \Theta} \{CCF_{\theta, i} : \theta \in \Theta\}\]
1.2. Return this rotation with associated CCF and translation; call them \((\hat{\theta}_i,\widehat{\Delta x}_{i}, \widehat{\Delta y}_{i},CCF_{i})\)
Estimate the consensus rotation and translation as
\[\hat{\theta} = \text{median}(\{\hat{\theta}_i : i = 1,...,N\})\]
\[\widehat{\Delta x} = \text{median}(\{\Delta x_{i} : i = 1,...,N\})\]
\[\widehat{\Delta y} = \text{median}(\{\Delta y_{i} : i = 1,...,N\})\]
The consensus values are based on each cell’s “top” vote.
\[|\hat{\theta}_i - \hat{\theta}| \leq T_{\theta}\]
\[|\widehat{\Delta x}_{i} - \widehat{\Delta x}| \leq T_{\Delta x}\]
\[|\widehat{\Delta y}_{i} - \widehat{\Delta y}| \leq T_{\Delta y}\]
\[CCF_i \geq T_{CCF}\]
Otherwise, it is a non-CMC.
Cells are classified as CMCs if the estimated translations and rotation are close to the consensus and the associated CCF is large.
cmcR
Break down each step of the algorithm into simple “modules”
Arrange modules in-sequence with the pipe (%>%) operator

Tidy principles of design (Wickham et al. 2019):
Reuse existing data structures. Enables knowledge transfer between tool sets.
Compose simple functions with the pipe. Encourages experimentation and improvement.
Embrace functional programming. Promotes understanding of individual functions.
Design for humans. Eases mental load by using consistent, descriptive naming schemes.
Open-source code and data make algorithms accessible in terms of literal acquisition
Should be minimum standard in forensics
Encourages more transparent, equitable justice system
A “tidy” architecture makes algorithms conceptually accessible
Modularization enables experimentation and improvement
Modules are easily reordered or replaced
Lingering question: why does the algorithm work the way it does?

A number of questions arise out of using comparison algorithms
How do we know when a scan is adequately pre-processed?
Why does the algorithm “choose” a particular registration?
What factors influence the final similarity score?
We wanted to create tools to address these questions
Well-constructed visuals are intuitive and persuasive
Useful for both researchers and practitioners to understand the algorithm’s behavior

Emphasizes extreme values in scan that may need to be removed during pre-processing
Allows for comparison of multiple scans on the same color scheme
Map quantiles of surface values to a divergent color scheme




\[\mathcal{F}_{cond}(X) = \begin{cases}x_{ij} &\text{ if $cond$ is TRUE for element $i,j$} \\ NA &\text{otherwise}\end{cases}.\]


There still may be “local” similarities between two non-match surfaces
Takeaway: comparison plot even helps us understand non-match comparisons

For a matching cartridge case pair…
There should be (many) more similarities than differences
The different regions should be relatively small
The surface values of the different regions should follow similar trends
Statistics are useful for justifying/predicting the behavior of the algorithm
Ratio between number of similar vs. different observations

Compare to a non-match cell comparison:

Size of the different regions

Compare to a non-match cell comparison:

Correlation between the different regions of the two scans

Compare to a non-match cell comparison:

impressions: tidy implementation of visual diagnostic tools (x3p_filter, etc.)
geom_x3p to work better with ggplot2 functionality (Wickham 2016)cartridgeInvestigatR: interactive web application to apply and explore comparison algorithms
Allows all audiences the ability to interact with and understand comparison algorithms
Diagnostics aid in understanding how an algorithm works and how to improve the algorithm
Can also be used to identify specific instances in which the algorithm “goes awry”
cartrigdeInvestigatR provides user-friendly interface to interact with all steps of the comparison pipeline

Takeaways:
When extreme, non-BF observations are left in the scan, cells attract to “loudest” parts of the target scan.
When non-BF observations are removed, the cells seem to align in the expected grid-like pattern
Visual diagnostics can be used before or after registration to understand the effect of extreme values.


Takeaways
The middling visual diagnostics and estimated similarity score reflect the fact that neither of these pairs seem to have highly distinctive markings.
In other words, these are “unexceptional” pairs in either direction to both us as the viewer as well as to the algorithm.


Features:
From the full scan comparison:
Similarities vs. differences ratio, \(r_{\text{full}}\)
Average and standard deviation of different region sizes, \(\overline{|S|}_{\text{full}}, s_{\text{full}, |S|}\)
Different region correlation, \(cor_{\text{full}, diff}\)
From cell-based comparison:
Average and standard deviation of similarities vs. differences ratios, \(\bar{r}_{\text{cel}}, s_{\text{cell}, r}\)
Average and standard deviation of different region sizes, \(\overline{|S|}_{\text{cell}}, \bar{s}_{\text{cell}, |S|}\)
Average different region correlation, \(\overline{cor}_{\text{cell}, diff}\)

Let \(A, B \in \mathbb{R}^{k \times k}\) be two cartridge case scans and \(d = A,B\) denote the comparison direction by the reference scan. For \(d = A\), applying the image registration algorithm results in aligned scan \(B^*\) (and \(A^*\) for \(d = B\)).
For \(d = A\), the similarities vs. differences ratio given by:
\[r_{A} = \frac{\pmb{1}^T I(|A - B^*| \leq \tau) \pmb{1}}{\pmb{1}^T I(|A - B^*| > \tau) \pmb{1}}\]
where \(\pmb{1} \in \mathbb{R}^k\) is a column vector of 1s. We also obtain the ratio in the other direction, yielding \(r_B\).
The full scan similarities vs. differences ratio is:
\[r_{\text{full}} = \frac{1}{2}(r_A + r_B)\].
Next, apply connected components labeling algorithm to \(cond\) matrix \(|A - B^*| > \tau\) to identify set of neighborhoods of “different” elements, \(\pmb{S}_{A} = \{S_{A,1}, S_{A,2}, ..., S_{A, L_A}\}\) where \(L_A\) is total number of neighborhoods in direction \(d = A\). Repeat in other direction, yielding \(\pmb{S}_B\).
Compute average and standard deviation of full scan neighborhood sizes across both comparison directions:
\[\overline{|S|}_{\text{full}} = \frac{1}{L_A + L_B} \sum_{d \in \{A,B\}} \sum_{l = 1}^{L_d} |S_{d, l}|\]
\[s_{\text{full}, |S|} = \sqrt{\frac{1}{L_A + L_B - 1} \sum_{d \in \{A,B\}} \sum _{l = 1}^{L_{d}} (|S|_{d, l} - \overline{|S|}_{\text{full}})^2}\].
Next, compute correlation between filtered scans \(\mathcal{F}_{|A - B| > \tau}(A)\) and \(\mathcal{F}_{|A - B| > \tau}(B^*)\) for \(d = A\) to obtain \(cor_{A, \text{full}, diff}\). Repeat in other direction, yielding \(cor_{B, \text{full}, diff}\).
The full scan differences correlation is given by:
\[cor_{\text{full}, diff} = \frac{1}{2}\left(cor_{A, \text{full}, diff} + cor_{A, \text{full}, diff}\right)\].
To accommodate cells, introduce subscript \(t = 1,...,T_d\) where \(T_d\) is total number of cells in comparison direction \(d = A,B\) that contain some non-missing values. For example, \(A_t\) denotes cell \(t\) in scan \(A\) and \(B^*_t\) its aligned mate in scan \(B^*\).
We can use the same procedures described above for the full scan comparisons, but now for each individual cell pair. We then compute summary statistics across the cells in both comparison directions to obtain comparison-level features. For example, \(r_{d,t}\) represents the similarities vs. differences ratio and \(\overline{|S|}_{d,t}\) the average labeled neighborhood size for cell \(t\) in direction \(d\).
The average and standard deviation of the cell-based similarities vs. differences ratio:
\[\bar{r}_{\text{cell}} = \frac{1}{T_A + T_B} \sum_{d \in \{A,B\}} \sum_{t = 1}^{T_d} r_{d,t}\],
\[s_{\text{cell}, r} = \sqrt{\frac{1}{T_A + T_B - 1} \sum_{d \in \{A,B\}} \sum_{t = 1}^{T_d} (r_{d,t} - \bar{r}_{\text{cell}})^2}\].
The average and standard deviation of the cell-wise neighborhood sizes:
\[\overline{|S|}_{\text{cell}} = \frac{1}{T_A + T_B} \sum_{d \in \{A,B\}} \sum_{t = 1}^{T_d} \overline{S}_{d,t}\],
\[\bar{s}_{\text{cell}, |S|} = \frac{1}{T_A + T_B} \sum_{d \in \{A,B\}} \sum_{t = 1}^{T_d} s_{d,t,|S|}\].
The average cell-based differences correlation:
\[\overline{cor}_{\text{cell}, diff} = \frac{1}{T_A + T_B} \sum_{d \in \{A,B\}} \sum_{t = 1}^{T_d} cor_{d,t,diff}\].
For the filter threshold \(\tau\), we use one standard deviation of the element-wise distance between a pair of aligned scans/cells. For example, for full scans \(A\), \(B^*\), we compute the standard deviation of the pooled values in \(|A - B^*|\).
For a matching cartridge case pair…
Correlation should be large at the full scan and cell levels
Cells should “agree” on a particular registration
Compute summary statistics of full-scan and cell-based registration results
Features:
Correlation from full scan comparison, \(cor_{\text{full}}\)
Mean and standard deviation of correlations from cell comparisons, \(\overline{cor}_{\text{cell}}, s_{cor}\)
Standard deviation of cell-based registration values (horizontal/vertical translations & rotation), \(s_{m^*}, s_{n^*}, s_{\theta^*}\)

For a matching cartridge case pair…
Cells should “agree” on a particular registration
The estimated registrations between the two comparison directions should be opposites

Features:
DBSCAN cluster indicator, \(C_0\)
Average DBSCAN cluster size, \(C\)
Absolute sum of density-estimated rotations, \(\Delta_{\theta}\)
Root sum of squares of the cluster-estimated translations, \(\Delta_{\text{trans}}\)





To compute the DBSCAN clusters in comparison direction \(d \in \{A,B\}\), we:
Use a 2D KDE to determine the rotation \(\hat{\theta}_d\) at which the estimated cell translations \(\{[m_{d,t,\theta}^*, n_{d,t\theta}^*] : \theta \in \pmb{\Theta} \}\) achieve the highest density.
Apply the DBSCAN algorithm to these highest-density translations, resulting in cluster \(\pmb{C}_d \subset \{[m_{d,t,\hat{\theta}_d}^*, n_{d,t,\hat{\theta}_d}^*]\}\).
We then compute the four density-based features using \(\pmb{C}_A\) and \(\pmb{C}_B\):
\[C = \frac{1}{2}\left(|\pmb{C}_A| + |\pmb{C}_B|\right)\]
\[C_0 = I\left(|\pmb{C}_A| > 0\text{ and }|\pmb{C}_B| > 0\right)\]
\[\Delta_\theta = |\hat{\theta}_A + \hat{\theta}_B|\]
\[\Delta_{\text{trans}} = \sqrt{(\hat{m}_A + \hat{m}_B)^2 + (\hat{n}_A + \hat{n}+B)^2}\]
Compute the 19 ACES features, \(\pmb{F}_{ACES}\), for each pairwise comparison
Use 510 cartridge cases from Baldwin et al. (2014) to fit random forest & logistic regression classifiers
Train random logistic regression using 21,945 pairwise comparisons from 210 scans
Consider three, nested feature sets:
Select classifier parameters that maximize AUC.
Test model on 44,850 pairwise comparisons from 300 scans
Compute true positive and true negative rates for each model
Consider distributions of similarity scores for truly matching and non-matching pairs
For each cartridge case scan pair \((A,B)\):
Label scan \(A\) as “reference” and \(B\) as “target” (the assignment is arbitrary, but useful to keep track of feature sets going forward).
Using rotation grid \(\pmb{\Theta} = \{-30^\circ, -27^\circ, ..., 27^\circ, 30^\circ\}\), perform image registration procedure to compute full scan registrations \((\theta^*_d, m^*_d, n^*_d)\) for \(d = A,B\).
Extract registered scans \(B^*\) and \(A^*\) for directions \(d = A,B\), respectively.
Compute full scan features 5 full scan features \((cor_{\text{full}}, cor_{\text{full}, diff}, \overline{|S|}_{\text{full}}, s_{\text{full}, |S|}, r_{\text{full}})\) using registered full scan pairs.
Perform cell-based comparison procedure using \(4 \times 4\) cell grid and rotation grids \(\pmb{\Theta}'_d = \{\theta^*_d - 2^\circ, \theta^*_d - 1^\circ, \theta^*_d,\theta^*_d + 1^\circ, \theta^*_d + 2^\circ\}\) for scan pairs \(A, B^*\) when \(d = A\) and \(B, A^*\) when \(d = B\).
Compute cell-wise estimated registrations \(\{(\theta_{d,t}^*, m_{d,t}^*, n_{d,t}^*) : t = 1,...,T_d\text{ and }d = A,B\}\).
Use cell-wise estimated registrations to extract aligned cell pairs \(\{(A_{t}, B_{t}^*) : t = 1,...,T_A\} \cup \{(B_t, A_t^*) : t = 1,...,T_B\}\).
Use aligned cell pairs to compute 5 cell-based registration features \((\overline{cor}_{\text{cell}}, s_{cor}, s_{m^*}, s_{n^*}, s_{\theta^*})\) and 5 visual diagnostic features \((\overline{cor}_{\text{cell},diff}, \bar{r}_{\text{cell}}, s_{\text{cell},r}, \overline{|S|}_{\text{cell}}, \bar{s}_{\text{cell},|S|})\).
Use 2D kernel density estimator to determine rotation \(\hat{\theta}_d\) at which cell-wise estimated translations \(\{(m_{d,t,\theta}^*, n_{d,t,\theta}^*) : \theta \in \pmb{\Theta}', t = 1,...,T_d\}\) achieve the highest density.
Using the high-density registrations to compute 4 density-based features \((C_0, C, \Delta_{\text{trans}}, \Delta_\theta)\).

Takeaways:
Accuracy metrics improve with progressively larger feature sets
Comparable performance between random forest (RF) and logisitic regression (LR) classifiers
Large difference in train/test True Positive rates
| Source | True Pos. (%) | True Neg. (%) | Overall Inconcl. (%) | Overall Acc. (%) |
|---|---|---|---|---|
| ACES LR | 95.9 | 97.8 | 0.0 | 97.7 |
| CMC Method | 74.2 | 97.7 | 0.0 | 96.1 |
| Ames I | 99.6 | 65.2 | 22.9 |
CMC method results based on our implementation in cmcR package (Zemmels, Hofmann, and VanderPlas 2022).
Ames I (Baldwin et al. 2014) compared quartets (3 to 1) and considered inconclusives.
ROC curves for four combinations of model/feature group.
Takeaways:
Feature set has larger impact on ROC/AUC than classifier model
Logistic regression (LR) model trained on full ACES set \(\pmb{F}_{ACES}\) yields lowest equal error rate.

Takeaways:
AUC is robust to DBSCAN parameter \((\epsilon, minPts)\) choice when full ACES feature set \(\pmb{F}_{ACES}\) is used.
Highest AUC attained for parameter choice \(\epsilon \approx minPts\), \(\epsilon,minPts < 10\).

Takeaways:
Cell-based cluster indicator \(C_0\) and cluster size \(C\) swap roles as more important depending on DBSCAN parameter choice
For \(minPts > \epsilon\), \(C_0\) is ranked as more important. Large \(minPts\) + small \(\epsilon\) imply stricter criteria for classifying clusters, so it’s more informative that a cluster exists than its size.
For \(minPts < \epsilon\), \(C\) is ranked as more important. Small \(minPts\) + large \(\epsilon\) imply looser criteria classifying clusters, so the actual size of the cluster becomes more informative.
Along with AUC sensitivity plot above, it appears that the “\(C_0\) + Registration”-trained models rely heavily on \(C_0\) feature. The “All ACES”-trained models are more robust.

Takeaways:
Density features \(C_0\) and \(C\) and registration features \(\overline{cor}_{\text{cell}}\) and \(cor_{\text{full}}\) are most important.
Visual diagnostic features ranked as less important overall
We consider classification accuracy as a means of selecting/comparing models.
In practice, the examiner would use the similarity score as part of their examination.


scoredscored package contains feature calculation/similarity scoring functionalitycaret/parsnip model (Kuhn and Max 2008; Kuhn and Vaughan 2023)Our automatic comparison pipeline is explicitly designed to be accessible in all meanings of the word
Forensics community should expect more from algorithms - effective and transparent
Code/data should be made available if at all possible
Use tidy architecture to improve comprehension and enable experimentation
Effective visual diagnostics aid in understanding and diagnosing all stages of the pipeline
Translating qualitative observations made with visual diagnostics into quantitative features naturally leads to more interpretable features
Non-trivial, yet worthwhile to develop user-friendly tools with which both programmers and non-programmers can interact
Accessible and effective algorithms lead to a more equitable and trustworthy justice system
Future work:
Generalizability of ACES
Additional stress tests (firearm and ammunition make/model, degradation levels)
What is an adequate definition of “relevant population” for F & T evidence?
Score-based likelihood ratios
Exploration of (non-)anchored approaches similar to Reinders et al. (2022)
Remedying the dependence structure (Fede & Danica)
Do trained classifiers capture both similarity and typicality (Morrison and Enzinger 2018)?
Further feature exploration
“Descriptive” features rather than “comparative” features
Speeded Up Robust Features (SURF) method and its cousins (Park and Carriquiry 2020)
2D orthonormal basis decomposition (Basu, Bolton-King, and Morrison 2022)
Characterize and segment important markings (striated vs. mottled, etc.)
Cell-based comparison is a naive way of doing this - want more targeted ways of identifying markings
Texture identification and segmentation methods (Gabor wavelets, autoencoder/generative adversarial NNs)
How useful are our tools to others?
Study how others use cartridgeInvestigatR
Improvements to visual diagnostics and cartridgeInvestigatR